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Abstract 

The  Test  of  Equality  between  Subsets  of  Coefficients  in  Two 
Regressions  is  developed  and  applied  as  a means  to  pre-screen  vari- 
ables from  a regression  model. 

Some  criterion  for  selection  of  variables  are  discussed  and 
some  existing  regression  packages  are  applied  to  data  on  character- 
istics of  avionics  equipment  for  comparison  purposes. 
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CRITERION  FOR  SELECTION  OF  VARIABLES  IN 
A REGRESSION  ANALYSIS 

I.  Introduc  tion 

Previous  Results 

At  the  request  of  the  Systems  Evaluation  Branch  of  the  Air 
Force  Avionics  Laboratory  at  Wright-Patterson  AFB,  the  Westing- 
house  Electric  Corporation  performed  a regression  analysis  of 
characteristics  of  Line  Replaceable  Avionics  Units  (LRU)  in  an 
attempt  to  model  some  of  the  equipment's  logistics  characteristics. 
Westinghouse  identified  21  independent  and  six  dependent  variables. 
Seven  of  the  21  independent  variables  are  purely  indicators,  identify- 
ing type  of  aircraft  and  general  usage  category  of  the  equipment. 

The  regression  was  performed  using  the  "Linear  Least  Squares 
Curve  Fitting  Program"  (LLSCFP)  developed  by  Daniel  and  Wood.  In 
using  LLSCFP,  all  21  independent  variables,  as  well  as  some  terms 
containing  the  squares  or  the  natural  logarithms  of  variables  are 
included  simultaneously  in  the  first  regression.  Then  with  the  aid  of 
statistics,  plots,  and  tabular  data  arrangement,  a subset  collection 
of  independent  variables  is  chosen  which  best  approximates  the  data. 


Scope  of  Present  Study 


There  are  several  ways  to  arrive  at  the  "best"'  subset  of  inde- 
pendent variables  to  include  in  a model.  At  one  extreme,  the  model 
can  consist  of  all  possible  variables.  But  this  is  not  desirable  for  two 
reasons.  First,  it  may  be  very  expensive  to  gather  and  maintain  a 
data  base  for  a large  number  of  variables,  some  of  which  may  have 
little  impact.  But  second  and  more  important,  when  the  purpose  is  to 
predict  future  costs,  as  it  is  in  this  study,  a model  that  uses  a large 
number  of  variables  to  fit  the  nuances  of  previous  data  may  in  fact  have 
a higher  prediction  variance  than  a subset  model  (Ref  17:7).  Important 
information  could  be  lest  in  the  myriad  interrelationships  that  exist. 
For  this  reason  tho  selection  of  the  form  and  the  variables  of  the 
regression  equation  becomes  important. 

The  methods  of  selection  of  variables  to  be  investigated  in  this 

study  are:  iterative  techniques  using  the  Statistical  Package  for  the 

Social  Sciences  (SPSS)  and  BMD  Biomedical  Computer  Programs 

(BMD),  all  possible  regressions  using  the  Leaps  and  Bounds  technique 

developed  by  Furnival  and  Wilson,  and  the  C statistic  search  using 

P 

LLSCFP  developed  by  Daniel  and  Wood. 

The  desired  final  outcome  of  this  analysis  is  to  provide  the 
personnel  at  the  Air  Force  Avionics  Laboratory  with  a method  of  per- 
forming a quality  regression  without  an  extensive  background  in  the 
technique,  so  that  they  can  do  work  in-house  which  they  previously 
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contracted  out.  For  that  reason  the  methods  used  in  this  work  will 
rely  on  existing  packages  where  possible. 

Theory  of  Linear  Least  Squares  Regression 

A s sumptions . The  first  assumption  made  in  the  use  of  linear 
least  squares  regression  is  that  the  correct  model  has  been  chosen. 

If  an  incorrect  form  is  used  some  values  given  by  the  equation  wiLl 
be  biased. 

The  second  assumption  is  that  the  data  is  typical  of  the  true 
population  about  which  the  analysis  is  being  performed. 

The  third  assumption  of  the  method  is  that  the  y observations 
are  statistically  uncorrelated  and  independent.  If  each  y value  is  con- 
sidered to  be  composed  of  a true  and  a random  error  value  called  t, 
then  this  assumption  can  be  restated  as:  The  expected  value  of  the 
product  of  any  two  of  the  random  components  is  zero. 

Three  other  assumptions  that  are  considered  Less  important 
(Ref  8:8)  are  that  ail  observations  on  y have  the  same  unknown  vari- 
ance, that  the  levels  of  the  independent  variables  are  non-stochastic, 
and  that  the  uncontrolled  error  is  distributed  normally. 

Method  of  Least  Sq ua res . The  general  form  of  the  linear  least 
squares  regression  equation  is 

y = *o+Vi+  /32x2  + -"  + Vk+  ‘ (i> 

where  y is  the  observed  value  of  the  dependent  variable,  x;  is  the 
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observed  value  of  the  independent  variable,  and  a and  0.  are 

o i 

regression  coefficients.  While  the  linear  squares  method  can  treat 
only  equations  in  the  form  of  equation  (1),  there  are  non-linear  equa- 
tions which  are  intrinsically  linear,  such  as  y = a0x^  x2^  ' 
taking  the  natural  log  this  equation  becomes 


In  y = in  a + 0 j In  x j + 0^  x2 


This  is  only  one  of  a variety  of  intrinsically  linear  equations  and 
methods  for  linearizing.  Equations  that  are  not  intrinsically  linear 
can  not  be  handled  using  the  linear  least  squares  method.  Non-linear 
least  squares  packages  are  available  but  will  not  be  considered  in  this 


s tudy. 


If  there  are  N independent  observations  of  y^  and  x^,  each  obser- 


vation can  be  written  as 


yi  = “o+<3lxil  + <32  xi2  + • • • + xiK+  ‘i 


where  x..  represents  the  i1^  observation  of  the  variable.  If  the 
xj 

matrices 


are  defined,  then  all  N equations  can  be  written  simultaneously  as 


X = X 0 + L 


The  basis  of  the  least  squares  method  is  that  a straight  line  is 
generated  through  data  points  in  such  a way  that  the  sum  of  the  squared 


distances  or  errors  between  the  line  and  the  points  is  minimized.  This 


sum  of  squared  errors  (SSE)  can  be  written  as 


E,.2) 


and  in  matrix 


notation  is 


2 < =11 


5 


•-  • • -i- 


I 


i 


f 


Substituting  (7)  into  (8)  yields 


±'  L = (1  - X £)'  (X  - X £)  (9) 

which  we  would  like  to  minimize.  If  classical  optimization  is  per- 
formed on  (9),  an  estimator  of  *3 

b = (X'X)_1X'X  (10) 

is  found  to  minimize  SSE. 

The  variation  of  the  dependent  variable  measurements  about 
their  mean 

- 2 

( Yi  - y)  (ii) 

is  called  Total  Sums  of  Squares  (TSS).  TSS  can  be  decomposed  to  two 
components,  the  Sum  of  Squares  Explained  by  the  Regression  (SSR)  and 
the  error  unexplained  by  the  regression  (SSE)  such  that 


SST  = SSR  + SSE 


(12) 


This  partitioning  gives  rise  to  a measure  of  goodness  of  fit  R2  or 

yx 

commonly  written  as  just  R2  and  called  the  Coefficient  of  Determina- 
tion or  the  Multiple  Correlation  Coefficient  Squared.  R2  is  defined  as 


R 


2 _ SSR 
' SST 


(13) 


and  represents  the  fraction  of  the  variability  in  the  independent 
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variable  explained  by  the  regression  equation.  But  it  is  known  that 
the  Multiple  Correlation  Coefficient  calculated  in  this  way  is  biased 
upward,  always  indicating  a higher  degree  of  correlation  than  actually 
exists  in  the  true  population.  In  order  to  correct  for  the  bias,  R^  is 
adjusted  for  degrees  of  freedom  by  the  relationship 


R2  = 1 - ( 1 - R2) 


\N-K-iy 


(14) 


and  called  the  Adjusted  Multiple  Correlation  Coefficient.  While  R2 
is  not  entirely  unbiased,  it  does  exhibit  less  bias  than  R2,  and  will  be 
used  in  this  analysis. 

A third  measure  of  goodness  of  fit  is  the  Cp  statistic  derived  by 
Mallows,  and  is  based  on  total  squared  error.  The  total  squared 
error  can  be  considered  to  be  made  up  of  a squared  bias  plus  a squared 
random  error  in  y at  each  data  point.  If  the  total  squared  error  is 
represented  as 


i - v2  +f}  Var(V 


(15) 


where  K - *^(X^  j,  X , . . . r XiN)  is  the  expected  value  of  y from  an 

equation  with  true  /3s,  ^ = b + b.X..  is  the  expected  value  of  ) 

° j = l J 1J 

■from  the  equation  of  estimates  of  /3s,  and  p is  K + 1.  The  term 


(^i'^J2  can  be  represented  by  SSEB  called  the  sum  of  squared 
errors  bias.  Also,  T can  be  defined  as  the  standardized  total  squared 
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SSEB 


P a2  72  i 


+ ~ 12  Var  (ys, 


<7-1=1 


It  is  known  [Daniel  and  Wood  (Ref  8:86)]  that 


2 Var(V  = p( 


Combining  (16)  and  (17)  forms 


SSEBd 

T E-  + p 

P <r2 


The  error  sum  of  squares  (SSE)  is  defined  as 


N 

SSE  = 


(Yi  - Yi)Z 


E(SSE)  = £ (u  - E( yi))2  + (N-pK2 


Because  E(y^)  = then 


E(SSE)  = £>.  -7J2  + (N-p)o2 
i=  1 


E(SSE)  = SSEBp  = (N-p)o- 
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Combining  (18)  and  (22)  gives 


r 

p 


E(SSEp) 

o2 


IN  - 2p) 


(23) 


Now  define  C as  an  estimate  of  r 
P P 

C p=^7£--(N-2p)  (24) 

where  sZ  is  an  estimate  of  02  . Note  from  (22)  that  when  the  correct 
model  is  used  the  bias,  SSBp,  goes  to  zero  and  Cp  goes  to  p. 

Then  the  objective  of  the  Cp  search  is  to  find  the  p-term  equa- 

Q 

tion  which  has  a R value  nearest  one  and  therefore  minimum  bias. 

P 

■A  drawback  to  the  method  is  that  it  is  sensitive  to  the  SZ  estimate  of 
variance.  The  Cp  obtained  from  two  different  models  may  not  be 
comparable  unless  the  same  value  of  SZ  was  used  in  both.  For  this 
reason,  the  LLSCFP  allows  the  option  of  using  from  an  entire  set 
of  input  variables,  from  a subset  of  variables,  or  a user  supplied 
value.  It  seems  sensible,  when  a large  number  of  models  are  being 
compared,  to  supply  a constant  value  of  so  the  C values  can  be 

P 

compared. 

It  should  be  noted  that  the  form  of  the  T and  C equations  (23) 

P p 

and  (24)  indicate  the  importance  of  eliminating  unnecessary  variables 
from  a model.  The  removal  of  one  variable  has  the  capability  to 

remove  as  much  as  two  units  from  the  standardized  squared  error. 
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II.  Selection  of  Variables 
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Variables 

Westinghouse  collected  data  amounting  to  63  points  on  21  physi- 
cal and  usage  characteristics  and  6 logistics  characteristics  of  line 
replaceable  avionics  units,  with  the  intent  to  predict  logistics  charac- 
teristics from  physical  and  usage  characteristics.  Sources  used  to 
collect  the  data  were:  existing  Air  Force  Data  Systems,  site  visits  to 
Air  Force  logistics  facilities,  Westinghouse  activities,  published 
reports,  and  engineering  analysis  of  LRUs.  The  following  paragraphs 
describe  briefly  each  of  the  variables.  For  a more  in-depth  descrip- 
tion of  the  variables  and  the  manner  in  which  they  were  collected,  see 
the  Westinghouse  report  (Ref  22:18). 

The  first  six  independent  variables  are  measures  of  physical 
characteristics.  The  Unit  Price  is  measured  in  dollars  per  LRU. 

The  Volume  is  measured  in  cubic  feet.  Weight  is  measured  in  pounds. 
Component  Count  is  a measure  of  the  number  of  electrical  components 
of  an  LRU  and  does  not  include  mechanical  devices,  connectors,  or 
structure.  Component  Density  is  simply  Component  Count  divided  by 
Volume.  While  Westinghouse  used  Component  Density  in  their 
analysis,  it  will  not  be  used  here,  because  the  matrix  of  data  would  be 
singular  in  the  log-linear  model  which  will  eventually  be  used.  Power 

1C 


Dissipation  is  measured  in  input  watts  minus  output  watts. 


The  next  five  independent  variables,  measures  of  component 
type,  are  in  terms  of  percentages  and  are  additive  to  unity.  The  vari- 
able names  are  descriptive,  but  if  information  on  the  way  the  measure- 
ments are  determined  is  desired,  the  Westinghouse  report  (Ref  22:18) 
should  be  consulted.  The  variables  are:  Fraction  Digital,  Fraction 
Analog,  Fraction  Electromechanical,  Fraction  Power  Supply,  and 
Fraction  Transmitter. 

The  twelfth  variable.  Fraction  Solid  State,  is  a measure  of  LRU 
technology,  the  percentage  of  components  in  the  LRU  that  are  solid 
state  in  nature. 

The  next  data  element  collected  concerned  aircraft  type  and 
usage.  Three  types  of  aircraft  were  Fighter,  Bomber,  and  Cargo. 

The  three  types  of  usage  were  Navigation,  Sensory,  and  Communica- 
tions. In  both  the  Westinghouse  and  this  study  these  parameters  were 
used  as  indicators,  but  in  different  ways.  Westinghouse  coded  them 
as  follows: 

Bomber  1 0 

Cargo  0 1 

Fighter  0 0 

Sensory  1 0 

Communications  0 1 

Navigation  0 0 

In  addition,  Westinghouse  found  interactions  between  certain  of  the  two 
types  useful  in  their  analysis.  These  were  coded  as  follows: 

1 1 


Bomber  Sensory  10  0 

Bomber  Communications  0 10 

Cargo  Communications  0 0 1 

In  this  study  a different  approach  was  taken  to  these  variables.  Inter- 
action between  all  of  the  aircraft  types  and  usages  were  considered  as 
indicator  variables  and  coded  as  follows: 

Nav  Fighter  1 0 0 0 0 0 0 

Nav  Bomber  0 1 0 0 0 0 0 

Nav  Cargo  0 0 1 0 0 0 0 

Sensor  Fighter  0 0 0 1 0 0 0 

Sensor  Bomber  0 0 0 0 1 0 0 

Comm  Fighter  0 0 0 0 0 1 0 

Comm  Bomber  0 0 0 0 0 0 1 

Comm  Cargo  0 0 0 0 0 0 0 

The  data  base  does  not  include  any  sensory  equipment  on  Cargo  air- 
craft, thus  it  is  not  used  as  a variable. 

The  final  independent  variable  identified  by  Westinghouse  is  the 
Percentage  of  Failures  Detected  by  Built-In-Test  (BIT).  This  variable 
is  intended  to  be  a measure  of  the  effectiveness  of  the  Built-In-Test/ 
Fault-Isolation-Test  (BIT/FIT)  capabilities  of  each  LRU. 

The  six  independent  variables  which  Westinghouse  identified  are 
Maintenance  Manhours /Cperating  Hour,  Mean  Time  Between  Failures, 
Mean  Time  Between  Maintenance  Actions,  Logistics  Support  Cost/ 
Cperating  Hour,  Training  Cos ts /Operating  Hour,  and  Percentage  Not 
Repairable  This  Station.  The  Operating  Hours  used  to  normalize  is  an 
estimate  of  the  amount  of  time  an  LRU  is  turned  on,  data  which  is  not 
recorded.  Instead,  Westinghouse  used  flying  hours /year  of  the  air- 


craft using  an  LRU  multiplied  by  a scaling  factor  which  they  estimated 

12 


Wm 


for  each  of  the  three  types  of  aircraft.  All  of  the  maintenance  and 
operating  hours  and  cost  data  for  these  six  variables  are  total  yearly 
figures . 

The  previous  analysis  regressed  the  independent  variables 
against  each  of  the  dependent  variables  individually.  Because  it  is  not 
the  purpose  of  this  study  to  dispute  the  earlier  one,  but  rather  to 
develop  a method  for  the  personnel  of  the  Avionics  Laboratory  to  per- 
form the  regression,  only  one  of  the  dependent  variables  was  chosen 
for  illustration.  Logistics  Support  Cost/Gperating  Hour  was  chosen 
because  the  Westinghouse  report  recommended  that  further  study  be 
conducted  on  it. 

Table  I contains  the  variables  and  the  abbreviations  used  in  this 
and  the  previous  report.  An  asterisk  indicates  that  the  variable  was 
not  used  in  the  study  indicated.  Table  II  contains  the  equation  found 
by  Westinghouse. 

Models 

It  is  well  known  that  the  selection  of  the  model  is  very  important 
to  the  goodness  of  fit  of  a regression  and  its  predictive  ability.  In 
light  of  this,  a balance  was  sought  that  would  yield  a model  with 
elements  of  both  simplicity  and  goodness  of  fit. 

The  first  model  considered  was  the  simplest  form  of  multiple 
regression  equation,  namely 


13 


Table  I icont'd) 


Name 


Abbreviations 


Wes  tinghous  e 


This  Report 


Sensory -Fighter 

* 

SF 

Sensory -Bomber 

BOMSEN 

SB 

Communications  -Fighter 

if 

CF 

Communications  - Bomber 

BOMCOM 

CB 

Communications  -Cargo 

CAR  COM 

COMMC 

Fraction  BIT /FIT 

BIT/FIT 

BF 

Logistics  Support  Cost/ 
Operating  Hour 

LSC/OH 

LSC/OH 

Maintenance  Manhours/ 
Operating  Hour 

MMH/OH 

* 

Mean  Time  Between 

F ailures 

MTBF 

if 

Mean  Time  Between 

Maintenance  Actions 

MT  B NLA 

if 

Training  Cost/Operating 

Hour 

TRAIN/OH 

if 

Not  Repairable  This  Station 

NRTS 

if 

Not  used  in  the  analysis. 

. ..  "j 

Table  II 


Logistics  Support  Cost/Cperating  Hour  Equation 
Generated  by  Westinghouse 
21 

In  ( LSC/CH)  = «0  + Y,  bi  Xi 
1=1 

R2  = .8916  R2  =.9283  F-value  = 25.3 


i 

bi 

Xi 

Partial  F 

0 

-8. 15108 

1 

3. 86111 

(IBOM  - . 286) 

36.  0 

2 

3.66533 

(ICAR  - .270) 

31.4 

3 

-4. 85271  x 10"1 

(ISEN  - . 254) 

3.6 

4 

-2. 56663 

(IBOM  - . 286)  (ISEN  - . 254) 

37.  2 

5 

-1. 66262 

(IBOM  - , 286)  (ICOM  - . 206) 

12.  2 

6 

-7.67253  x 10-1 

(ICAR  - . 270)  (ICOM  - .206) 

3.2 

7 

1. 27356  x IQ'2 

FPS 

6.  8 

8 

2.  25967  x 10*2 

(FAN  - 63.  3) 

36.  0 

9 

-7.42999  x 10’3 

(FSS  - 61.  1) 

9.0 

10 

2. 38503 

(UP  - 1.64) 

27.  0 

11 

-9.  20384  x 10‘U 

(UP  - 133606.  3)2 

25.0 

12 

-1. 52864  x 10’4 

(W  - 64.  3)2 

8.  4 

13 

-1.07105  x 10*3 

(FAN  - 48.  9)2 

33.6 

14 

1.20418  x 10*3 

(FEM  - 47.  0)2 

33.6 

15 

7.  10025  x 10'4 

(FXR  - 40. 2)2 

10.  9 

16 

-1.61651  x 10‘4 

(FSS  - 51. 9)2 

2.  2 

i 
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Table  II  (cont'd) 


n 

bi 

*i 

Partial  F 

17 

-1.  1 1568  x 10'6 

(PD  - 722)2 

7.  3 

18 

5. 00996 

(UP  - 1.68)2 

42.  2 

19 

1. 70042  x 10'3 

(BF  - 27.  3)2 

13.  0 

20 

4.60293  x 10'1 

LN(UP) 

31.4 

21 

2.35583  x 10'1 

LN(V) 

4.  8 

y = aQ  +/?i  Xi  +/32x2  + ...  + ^ + « (1) 

When  this  model  was  used  to  form  a prediction  equation  on  the  original 

data  using  SPSS,  it  did  not  provide  satisfactory  goodness  of  fit,  having 

a Correlation  Coefficient  Squared  of  approximately  0.45,  indicating 

that  the  values  predicted  by  the  model  were  poorly  matched  to  the 

actual  values.  The  model  fell  far  short  of  the  R2  of  approximately 

0.93  achieved  by  Westinghouse. 

A model  was  desired  which  would  allow  more  possibility  of 

interactions  and  not  be  restricted  to  linearity  of  variables.  At  the 

same  time  it  was  noted  that  the  five  measures  of  component  type, 

it 

namely  %DIG,  %AN,  foEM,  %PS,  %XMTR,  along  with  the  Fraction  of 
Malfunctions  Detected  by  BIT/FIT,  BF.  could  all  be  converted  to  indi- 


consisted  of  13  indicator  or  dummy  variables  and  6 ordinary  inde- 
pendent variables. 


The  model  settled  on  was  one  of  considerable  complexity  but 
offered  many  possibilities  for  interactions  between  variables.  The 
Product  of  Powers  model  is  of  the  form 


y = e 


(*< 


D;) 


Pi 


(0. 

JO 


2^1 


D,  ] 


(25) 


where  a,  /?,  and  x have  the  same  meaning  as  described  in  the  previous, 
model,  and  is  indicator  or  dummy  variable  i and  j is  the  index  of 
ordinary  variable  x.  Because  this  model  is  not  linear,  least  squares 
regression  can  not  be  used  on  it  without  a transformation.  If  the 
natural  logarithm  is  taken  on  both  sides  of  (25),  the  resulting  equation 
is 


13  6 

ln  y = ao  + £ «i  Di  + £ 0jo  In  xj 
i=l  j=l 


6 13 

+ .£  2}  ^jiDi  In  xj 


(26) 


Linear  least  squares  regression  can  be  applied  to  the  model  in  (26), 
but  there  are  now  97  possible  variables.  Because  there  are  only  63 
data  points,  selection  of  variables  has  become  of  paramount  importance. 
And  because  only  one  of  the  packages  used  in  this  study,  SPSS,  is 
capable  of  handling  this  number  of  terms,  some  method  was  needed  to 
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eliminate  terms  prior  to  regression.  That  is  where  a test  of  equality 
of  regression  populations  was  useful. 


Test  of  Equality  of  Regression  Populations 

The  coefficients  and  constant  terms,  /3s  and  as  in  (26)  could  be 
written  in  vector  form  as 


P = 


ao  + + • 

^1  0 + '51  1 


, . + a13 
+ . . . + /?, 


Pi  0 + ^ 1 + ’ • * + 


1 13 
13 


(27) 


Pb  0 + Pb  1 + * ‘ * + ^6  13 


by  noting  that  In  xj  is  a common  term  in  the  second  and  third  term  of 
(26)  and  that  Dj  = 0 or  1.  If  only  one  indicator  is  considered  at  a time 
and  all  others  are  considered  constant,  (27)  becomes 


a + a 

o i 

P,  n +/* 


Eb  = 


1 0 T H i 
Pi  0 + Pi  i 


Pb  0 + Pb  i 


(28) 


Equation  (28)  can  be  divided  into  two  subsets  depending  on  whether 
D.  = 0 or  1. 
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if  = 0.  If  = 1,  then  equation  (28)  is  the  case. 


If  it  could  be  shown  that  equation  (28)  and  (29)  are  not  signifi- 
cantly different,  then  (29)  would  be  used  and  the  interaction  terms 
between  that  and  all  Xj  as  well  as  that  cl^D^  would  not  be  included  in 

the  model.  Further,  if  it  could  be  shown  that  /?  = /?  where 

a — b 


V 


“o  + ai 
5 0 + /?1  i 
0 

^3  0 + ^3  i 


% 0 


+ i 


(3o: 


and  = equation  (28),  then  the  individual  interaction  between  x^  and 
that  would  not  be  needed  in  the  model.  Any  of  the  or  /3 terms 
could  be  so  tested  either  singly  as  above  or  in  combinations. 

Chow  describes  such  a test  (Ref  7:599)  and  calls  it  a Test  of 
Equality  Between  Subsets  of  Coefficients  in  Two  Regressions.  The 
same  technique  was  described  by  Fisher  (Ref  12:364). 

Under  the  alternative  hypothesis  £_  the  model  becomes 


I 


i 


-jl 


Vl  = X1  ^1  + ‘i  = ziyi  +W1S1  + <1  (31) 

y?  = X2 '3  Z + ( 2.  ~ Z2yz  + W 2S2  + <2  (32) 


— 

*1 

= 

z1  0 

1 

o 

■T 

> 

+ 

'i 

_y2 

= 

o z2 

0 W2J 

4J 

_f2_ 

where  Zy  Z^,  Wj,  and  are  submatrices  of  the  X matrix  of  data 


1 X1  1 X1  2 ‘ ' X 1 K 
1X21X2Z-  X2K 


1 XN  i * " XN  K 


( 4) 


The  matrix  Zj  contains  those  elements  of  X for  the  variables  which 
are  being  tested  and  in  which  the  being  considered  is  equal  to  zero. 
Z2  contains  those  elements  of  X for  the  variables  which  are  being 
tested  and  in  which  the  being  considered  is  equal  to  one.  Wj  con- 
tains those  elements  of  X for  the  variables  which  are  not  being  tested 
and  in  which  the  equals  zero.  W ^ contains  the  remainder  of  X, 
those  elements  for  which  the  variable  is  not  being  tested  and  D-  equals 
one.  The  vector  contains  the  regression  coefficients  which  are 
being  tested,  assuming  Dj^  equais  zero,  such  as  0^  Q in  (30).  The 

21 


L 


vector  ~1 2_  contains  the  coefficients  being  tested  assuming  D.  equals  one 
such  as  fig  o + ^ • The  vector  contains  the  regression  coefficients 
not  being  tested  assuming  equals  zero.  The  vector  5^  contains  the 
coefficients  not  being  tested  assuming  equals  one.  For  example,  if 
it  is  desired  to  test  whether 


^10  + #1  l 

^2  0 

Hi  i 


h 0 + ^1  1 

h 0 + ^2  1 

h o + <h  i 

@4  0 + ^4  1 

H 0 + ^5  ! 


‘n+ 1 2 
[n+2  2 


n+m  2 


Assuming  that  Dj  equals  zero  for  the  first  n observations  and  one  for 


the  next  m observation. 


(42) 


“o  x “1 

ft\  0 t 31  1 

ft2>  0 r h 1 

/?4  0 + 34  1 

/35  0 + % 1 

^6  0 + '5b  1 


Under  the  null  hypothesis  ft  -ft  , 

~ a — b 


equation  (34),  the  model 


becomes 


= 

1 

O 

►H 

£ 

.N 

I 

h 

+ 

V 

_*2 

z2  0 

s. 

'2 

(43) 


•where  Z Wj,  W^#  5 i , and  ^ are  the  same  as  in  (33)  and  con- 

tains the  regression  coefficients  of  the  variables  being  tested  assumixig 
equals  zero.  In  the  example  being  used  here. 


(44) 


Under  the  null  hypothesis,  the  least  squares  estimators  of  7, 
and  62  are 


’co 

ilZl  + Z2Z2ZlWlZ2W2" 

-1 

Z'  Z ' 

_ 

dl  0 

= 

WjZj  W*  Wj  0 

W'  0 

• 

yi 

d2  0. 

W2Z2  0 w^w2 

— . -1 

0 w2 

Jz 

(45) 

where  yj  is  a vector  of  observed  y values  for  which  the  corresponding 
equals  zero,  and  y 3 is  a vector  of  observed  y values  for  which  the 
corresponding  equals  one.  Continuing  the  same  example  as 
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previously. 


Under  the  alternative  hypothesis. 


Z1Z1 


0 

Z2Z2 

0 

w z 

2 2 


Z1W1 

0 

W L 


Z2W2 


w2w2 


Once  the  estimated  coefficients  are  found,  an  F-test  can  be 
used  to  test  the  null  hypothesis.  If  m > p,  where  p is  the  length  of  the 
vectors  in  (34)  the  test  is 


(q,  m+n-2p) 


lZiCi+Widi-zlCo-Widio|2+  |Z2c2+W2d2-Z2cQ-W2d20f 
|yrZici-W1d1  |2+  Jy2-Z2c2-W2d2  |2 


m+n-2o 


4 


Li  p-q  < m ^ p where  q is  the  number  of  variables  being  tested  or  the 
length  o£  vector  7^  in  (39),  the  test  is 


|Z  jc  !+W  jd  L - Z jCq-W  Ld  10  |2+  |y2*Z2co'W2d20  j2 
^(m-p+q,  n-p)  s |y1-Z1c1-W1d1  |2 


n-T 


m-p+q 


(50) 


If  the  calculated  F value  is  greater  than  the  table  F value  at  the 

desired  level  of  confidence,  then  reject  HQ  and  include  the  interaction 

term  in  the  model.  If  the  calculated  F is  lower  than  the  table  F, 

accept  Hq  and  do  not  include  the  interaction  term  in  the  model.  If 

m<  p-q,  the  test  can  not  be  performed  in  which  case  the  null  hypothesis 

has  not  been  rejected  and  the  variables  are  not  included  in  the  model. 

In  doing  this  study,  the  Chow  test  was  performed  on  each  /^q 

+ and  each  term  individually  for  each  combination  of  j and  i. 

Thus  the  test  had  the  potential  to  eliminate  any  of  78  and  13  a. 

Ji 

terms  of  the  97  possible  in  the  product  of  powers  model  used. 

To  perform  the  calculations  for  these  tests  three  programs 
were  written.  The  first  program  called  YSEPR  simply  separated  the 
observed  values  of  y into  the  vectors  ^ and  ^ *or  each  of  the  13  dummy 
variables.  A listing  of  the  program  is  contained  in  the  appendix  on 
Figure  4. 

The  second  program,  called  SUMS,  was  written  to  calculate 
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all  of  the  combinations  of  sums  of  variables,  sums  of  squared  vari- 
ables, and  sums  of  cross  products  of  variables  grouped  as  in  YSEPR 
by  the  value  of  each  dummy  variable.  These  sums  are  calculated 
because  when  '•he  multiplications  required  in  the  large  matrices  in  (45) 
and  (48)  are  performed,  the  result  would  be  some  of  those  elements 
calculated  by  SUMS.  Rather  than  perform  the  calculations  repeatedly 
when  only  a small  systematic  change  is  needed  for  each  test,  they  are 
calculated  only  once  by  SUMS  and  read  when  needed  by  the  third  pro- 
gram. A listing  of  SUMS  is  contained  in  the  appendix  on  Figure  5. 

The  third  program,  called  CHOW,  performs  the  calculations 
in  (34),  (37),  (38),  and  (39)  and  outputs  the  F values  with  the 


parameters  necessary  to  test  the  hypotheses.  A listing  of  CHOW  is 
contained  in  the  appendix  on  Figure  6.  Table  III  contains  the  calcu- 
lated F values  generated  by  CHOW.  Table  IV  contains  the  ranks  of  the 
F values  generated. 

In  the  tests  performed  here,  p-q<m<p  never  occurred.  There- 
fore, the  degrees  of  freedom  for  all  of  the  tests  were  q and  m4n-2p  or 
1 and  49  where  m+n  = 63  and  p = 7. 

The  coefficients  associated  with  dummy  variables  SB,CF,CB, 
and  PS  could  not  be  tested  because  m was  less  than  p-q.  As  a result, 


there  were  91-24  = 67  variables  tested  using  the  Chow  test.  If  the 
overall  hypothesis  that  none  of  the  ^ ^ terms  differ  from  ^ 

or  aQ  + terms  differ  from  aQ  is  true,  then  the  probability  that  at 
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least  one  of  the  individual  hypotheses  will  be  rejected  would  be  less 
than  a * where 


a* 


= 1 - (1  - 


,67 


(51) 


and  is  the  level  at  which  individual  tests  are  conducted.  Then 


a = 1 - ( 1 


*.1/67 
a ) 


(52) 


If  a = 0.  10  is  desired  then  the  associated  value  of  a would  be  0.00157. 
Using  the  tabled  F-value  for  a = 0.001  assures  that  a*  is  less  than 
0.  10.  Values  of  a'  less  than  0.  10  can  not  be  tested  at  this  time 
because  available  tables  of  F-values  only  go  down  to  0.001.  When  the 
calculated  F is  larger  than  F0>001(  ? = 12.  11  the  individual 

hypothesis  is  rejected  and  the  interaction  ^ term  can  not  be  removed 
from  the  model  on  the  basis  of  this  test.  Table  III  contains  the  calcu- 
lated F-values  for  each  of  the  tested  terms.  Variable  0 indicates 
the  «0.  term.  The  numbering  system  used  in  program  CHOW  is  shown 
at  the  bottom  of  the  table. 

Only  13  terms  had  F-values  lower  than  the  critical  F.  These  in 
addition  to  the  24  variables  which  could  not  be  tested  and  therefore  did 
not  fail  a test  of  the  null  hypothesis  mean  that  37  variables  have  been 
eliminated  from  the  it  started  with,  leaving  60  variables  still  in  the 
model.  While  this  appears  disappointing  at  first  glance,  leaving  more 
variables  still  in  the  model  than  LLSCFP  or  Leaps  and  Bounds  can 
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Table  III 


Calculated  F Values  from  CHOW  Test 


Variable 

Dummy 

0* 

1 

2 

3 

4 

5 

6 

1 

46.792 

35. 957 

25. 792 

25. 083 

25. 321 

4.445 

30. 243 

2 

48.9942 

31.  165 

27. 368 

17. 834 

37. 558 

2.  207 

13.  168 

3 

47. 188 

35.020 

20.  261 

42.983 

.434 

1. 664 

2.  142 

4 

25. 038 

4.  028 

43. 242 

42.912 

35. 336 

12.666 

. 138 

5 

NT 

NT 

NT 

NT 

NT 

NT 

NT 

6 

NT 

NT 

NT 

NT 

NT 

NT 

NT 

7 

NT 

NT 

NT 

NT 

NT 

NT 

NT 

8 

49. 002 

36.238 

36.429 

47. 934 

44. 038 

48. 998 

35. 319 

9 

49.440 

40. 300 

14.671 

45.948 

. 735 

7.  368 

. 681 

10 

48.900 

35. 818 

43. 052 

21. 842 

28.715 

7.510 

19. 331 

11 

NT 

NT 

NT 

NT 

NT 

NT 

NT 

12 

54.984 

42.913 

45.636 

35.  172 

15. 782 

17.991 

15. 147 

13 

46 . 648 

39.791 

22.918 

39.312 

11. 510 

15. 345 

11.  583 

Variable 

Names 

1 2 

3 

4 5 

6 

UP  V 

W 

CC  %SS  PD 

Dummy  Names 

1 2 

3 4 

5 6 

7 

8 9 

10  11  12 

13 

NF  NB 

NC  SF 

SB  CF  CB 

DIG  AN 

EM  PS  XMTR  BF 

I 


NT  indicates  insufficient  points  in  a subset  to  test. 
I ‘'‘Variable  0 indicates  the  a^D^  term. 


handle,  experimentation  will  be  conducted  in  later  chapters  in  which 
the  rankings  of  the  F -values  will  be  used  to  try  to  preselect  a set  of 
variables  for  a model.  The  rankings  of  the  F-values  are  shown  in 
Table  IV,  with  the  lowest  F-value  being  ranked  1.  Also  sub-optimized 
solutions,  in  which  the  60  variables  are  broken  into  subsets  to  search 
for  smaller  subsets  that  can  be  combined,  to  form  manageable  sets, 
will  be  tried.  Table  V contains  a list  of  the  variables  remaining  in 
the  model  after  the  Chow  test. 
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Table  IV 


Rank  of  F-Values  from  CHOW  Test 


V ariable 


Dummy 

0* 

1 

2 

3 

4 

5 

.6 

1 

55 

39 

29 

27 

28 

9 

32 

Z 

59 

33 

30 

20 

42 

7 

15 

3 

56 

34 

23 

48 

2 

5 

6 

4 

26 

8 

50 

46 

37 

14 

i 

5 

NT  | 

NT 

NT 

NT 

NT 

NT 

NT 

6 

NT  . 

NT 

NT 

NT 

1 NT 

NT 

7 

NT 

NT 

NT 

NT 

NT 

8 

i 

61 

! 

j 40 

51 

60 

36 

9 

62 

45 

16 

53 

4 

10 

3 

10 

58 

38 

49 

24 

31 

11 

22 

11 

NT 

NT 

NT 

NT 

NT 

NT 

NT 

12 

63 

47 

52 

35 

19 

21 

17 

13 

54 

44 

25 

43 

12 

18 

13 

NT  indicates  that  there  were  insufficient  points  in  one  of  the  subsets 
to  perform  the  test. 

* 

Variable  <D  indicates  the  term. 
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Table  V 

Variable  Remaining  in' the  Model  After  the  Chow  Test 


1 

UP 

21 

NF*V 

41 

DIG’|:%SS 

2 

V 

22 

NF*W 

42 

DIG*PD 

3 

w 

23 

NF*CC 

43 

AN*  UP 

4 

cc 

24 

NF*PD 

44 

AN*V 

5 

%ss 

25 

NB*UP 

45 

AN*W 

6 

PD 

26 

NB*V 

46 

EM*UP 

7 

NF 

27 

NB  * W 

47 

EM*V 

8 

NB 

28 

NB*CC 

48 

EM*W 

9 

NC 

29 

NB*PD 

49 

EM*CC 

10 

SF 

30 

NC*UP 

50 

EM*PD 

11 

SB 

31 

NC*V 

51 

XMTR*UP 

12 

CF 

32 

NC*W 

52 

X MT  R *V 

13 

CB 

33 

SF*V 

53 

XMTR*W 

14 

DIG 

34 

SF*W 

54 

XMTR*CC 

15 

AN 

35 

SF*CC 

55 

X MT  R *%SS 

16 

EM 

36 

SF*%SS 

56 

XMTR*PD 

17 

PS 

37 

DIG*UP 

57 

BF>;,UP 

18 

XMTR 

38 

DIG’^V 

58 

BF*V 

19 

BF 

39 

DIG*W 

59 

BF*W 

20 

NF*UP 

40 

DIG^CC 

60 

B F *%SS 

i 

61  LSC/GH 
(Deoendent  Variable) 
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Mechanics 

The  three  most  commonly  used  iterative  techniques  for  determ- 
ining the  proper  variables  in  a regression  are;  backward  elimination, 
forward  selection,  and  stepwise  regression. 

All  three  of  the  above  methods  make  use  of  the  Partial  F-test. 

In  this  test,  the  explained  sums  of  squares  (SSR)  are  decomposed  into 
components  attributable  to  each  independent  variable. 

In  the  standard  regression  method  of  decomposition,  each  vari- 
able is  treated  as  if  it  had  been  added  to  the  regression  equation  in  a 
separate  step  after  all  other  variables  had  been  included.  These 
F-values  are  then  used  to  determine  the  next  variable  to  enter  or  leave 
the  equation,  depending  on  the  type  of  iterative  regression  being  per- 
formed. The  F-value  is  given  by 


SSR  due  to  x^/  1 
" SSE/(N-K-1) 

(Ref  20:336) 

_ ry(i.l,2,...K)/1 

1-Ry.i,2,...K'<N-K-» 


(53) 


The  term  r^j  j 2 K)  i8  the  part  corr«l-ati°n  indicating  the  relation- 
ship between  the  observed  y and  the  residual  of  independent  variable  i 
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from  which  the  effects  of  the  other  independent  variables  have  been 
removed. 

If  the  hierarchical  decomposition  method  is  used  instead  of  the 
standard  regression  method,  the  order  of  inclusion  must  be  specified 
and  is  used  to  determine  the  order  to  enter  variables  rather  than  a 
partial  F test.  The  variable  included  first,  the  one  with  the  highest 
assigned  inclusion  level,  is  evaluated  by  the  ratio. 


F 


-Ry.l,2 K>  <N-K-1) 


(54 


The  second  regression  coefficient  is  tested  by  the  ratio. 


F = 


_ ry(2.  1) 


/I 


-Ry.l.2 kI/(N-K-1) 


Ref  20:337) 

_ incremental  SS  due  to  X2/I 
SSE /(N-K-  1) 


(55) 


Each  successive  variable  to  be  included  in  a hierarchical  fashion 
would  be  evaluated  in  the  same  manner  as  indicated  above,  with  a 
squared  part  correlation  of  the  form  rf..  . , , . where  i is 

the  variable  being  added,  in  the  numerator.  All  of  the  ratios  above 
should  be  compared  to  the  tabled  F distribution  with  1 and  iN-K-1) 
degrees  of  freedom. 
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The  hierarchical  method  of  regression  is  used  when  there  is 
some  basis  for  believing  that  some  variables  will  explain  more  vari- 
ance than  others  before  the  regression  is  accomplished.  Because 
there  were  no  prior  feelings  about  the  variables  used  in  this  analysis, 
the  hierarchical  method  was  not  used. 

In  the  background  elimination  method,  a regression  equation  with 
all  of  the  possible  terms  is  the  starting  point.  Then  partial  F-value  is 
calculated  for  each  variable.  If  the  lowest  partial  F-value  is  less  than 
some  preselected  F-value,  the  variable  corresponding  to  the  value  is 
removed  from  consideration,  and  the  procedure  is  repeated  from  calcu- 
lation of  F-values  for  the  new  equation.  If  at  any  iteration  there  are  no 
variables  that  have  an  F-value  low  enough  to  cause  removal,  the  equa- 
tion is  adopted  as  calculated. 

The  forward  selection  method  operates  by  entering  variables  one 
at  a time  until  a satisfactory  equation  is  reached.  The  order  of 
inclusion  is  determined  by  using  partial  correlation  coefficients  as  a 
measure  of  relative  importance  of  the  variables  not  yet  in  the  equation. 
At  each  step,  the  variable  with  the  highest  partial  correlation  coef- 
ficient, that  is,  the  variable  with  the  highest  correlation  to  the  depend- 
ent variable  after  allowing  for  the  effect  of  previously  included  vari- 
ables, is  brought  into  the  equation.  This  procedure  is  repeated  until 
the  partial  F-value  of  the  latest  entered  term  is  less  than  some  prese- 
lected value;  then  the  equation  is  adopted. 

3 5 


The  stepwise  regression  method  is  a refinement  of  the  forward 
selection  method  in  that  at  every  stage  of  the  procedure,  the  variables 
included  in  the  model  in  earlier  stages  are  examined.  Thus  a variable 
which  was  entered  at  an  earlier  stage  but  has  been  rendered  unimpor- 
tant by  its  relationship  to  later  entered  variables  will  be  detected  and 
removed  from  the  equation.  Again,  this  is  done  by  comparing  the 
partial  F-values  to  a preselected  F-value  to  find  any  variables  to  be 
removed.  In  later  iterations,  the  removed  variable  is  treated  the 
same  as  a variable  that  has  never  entered  the  equation.  The  criterion 
for  inclusion  into  the  stepwise  model  is  the  same  as  in  the  forward 
selection  method. 

While  calculation  of  the  partial  F-values  needed  to  use  these 
regression  methods  would  be  too  time  consuming  to  perform  manually, 
both  of  the  regression  packages  used  in  this  chapter,  SPSS  and  BMD, 
provide  them,  making  the  two  packages  easy  and  convenient  to  use. 
Because  stepwise  regression  incorporates  the  main  features  of  back- 
ward elimination  and  forward  selection,  it  is  considered  to  be  the  best 
of  the  three  methods  (Ref  11:172).  While  it  has  been  shown  that  the 
three  methods  do  not  always  choose  the  same  subset  (Ref  17:9),  the 
stepwise  method  is  more  likely  to  choose  the  best.  For  that  reason, 
of  the  three  methods,  stepwise  regression  will  be  used  here. 

In  using  iterative  regression  method,  two  pitfalls  should  be  kept 
in  mind.  First,  no  significance  should  be  attached  to  the  order  in  which 


3c 


variables  are  entered  or  removed  from  the  modeL.  The  first  variable 


entered  is  not  necessarily  the  most  important  one  when  other  variables 
have  been  entered.  The  partial  F-value  of  each  variable  must  be  con- 

t V 

sidered  for  each  variable  after  each  step  to  determine  relative  signifi- 
cance. Secondly,  there  is  no  guarantee  that  any  of  the  three  methods 
will  supply  an  optimal  equation,  because  of  the  restriction  of  removing 
and  entering  one  variable  at  a time. 

It  was  previously  stated  that  preselected  values  of  the  F distri- 
bution are  used  as  criterion  in  including  and  deleting  variables  from 
the  equation.  In  both  the  SPSS  and  BMD  packages  these  values  are 
called  FIN  and  FOUT  respectively.  A problem  arises  in  preselecting 
these  values,  as  the  degrees  of  freedom  are  not  known  exactly  before 
a regression  is  run.  The  tabled  F-value  has  degrees  of  freedom  1 and 
(N-K-l)  where  N is  the  number  of  data  points  and  K is  the  number  of 
independent  variables  in  the  equation.  As  the  primary  goal  of  an 
analyst  is  normally,  to  maximize  some  measure  of  goodness -of-fit,  such 
as  R,  while  a secondary  goal  is  to  minimize  K,  K is  not  known  exactly 
before  the  regression.  Therefore,  the  values  of  FIN  and  FOUT  must 
be  determined  from  an  estimated  K.  As  a result  some  experimenta- 
tion may  be  necessary  to  obtain  the  value  of  K which  yields  desired 
values  for  the  goodnes s -of-fit  measure. 

An  alternative  is  to  use  the  default  F-values  for  FIN  and  FOUT, 
0.01  and  0.005,  respectively.  These  values  are  designed  to  include 
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all  variables  and  not  remove  any  from  the  equation;  in  effect,  a forced 
fit  unless  there  are  a large  number  of  variables.  In  the  course  of  the 
analysis,  regressions  were  performed  on  identical  data  using  varying 
values  for  FIN  and  FOUT.  When  values  other  than  default  were  used, 
truncated  versions  of  the  same  equation  obtained  from  the  default 
values  were  identified.  It  appears  the  simplest  procedure  may  be  to 
use  default  values  of  FIN  and  FOUT  until  the  value  of  K is  known,  then 
use  the  table  value  of  F at  degrees  of  freedom  1 and  (N-K-l)  as  FIN. 
FOUT  should  be  slightly  smaller  than  FIN,  but  other  than  that  its  selec- 
tion is  rather  arbitrary. 

The  Packages 

As  would  be  expected  when  using  packages  that  operate  in  the 

same  way,  SPSS  and  BMD  yield  the  same  answers  and  even  employ  the 

same  format  of  output.  Each  prints  at  each  step,  the  value  of  the 

multiple  correlation  coefficient,  R,  the  standard  error  of  estimate,  an 

analysis  of  variance  table,  the  coefficients  of  all  variables  currently  in 

the  equation,  the  standard  error  of  the  coefficients,  and  the  partial 

F-value  of  all  independent  variables  whether  or  not  in  the  equation.  In 

-2 

addition,  SPSS  provides  R and  R . Both  also  list  the  value  of  calcu- 
lated F to  test  the  hypothesis 
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Plots  of  residuals  versus  the  sequence  of  cases  in  a file  can  be 
selected  as  options  in  both  packages.  Additionally,  BM D will  make 
plots  of  residuals  versus  all  or  selected  independent  variables  as  an 
option.  Other  options  for  both  input  and  output  are  available  and  identi- 
fied in  the  appropriate  manuals  (Ref  20:352)  and  (I’ef  10:235).  Examples 
of  output  are  available  in  both  manuals  (Ref  2 0 : 3 o 0 ) and  (Ref  10:249)  and 
partial  output  prepared  for  this  report  are  included  in  the  appendix  in 
Figures  7 and  8. 

SPSS  is  capable  of  handling  up  to  100  independent  variables  per 
fit  and  BMD  can  handle  up  to  80.  While  both  of  the  packages  have  the 
capability  to  compute  transformations,  the  feature  was  not  used. 

Instead,  all  interactions  were  computed  using  a short  FORTRAN  pro- 
gram and  stored  with  the  or  .ginal  variables  on  a permanent  storage 
file  to  avoid  recalculating  them  numerous  times.  Table  VI  indicates 
the  location  of  all  variables  on  the  tape  used  for  both  stepwise  regres- 
sion and  Leaps  and  Bounds. 
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Table  VI 


Numbers  of  the  Variables  on  Permanent  File  Used  in 
SPSS  and  Leaps  and  Bounds 


These  are  the  variable  numbers  found  on  the  SPSS  output. 

1 2 

3 4 

5 6 

7 8 

9 10 

11  12 

13  14 

UP  V 

W CC  foSS  PD 

NF  NB 

NC  SF 

SB  CF 

CB  DIG 

15  16  17 

18  19 

AN  EM  PS  XMTR  BF 

UP 

V 

W 

CC 

%ss 

PD 

NF 

20 

21 

22 

23 

24 

25 

NB 

26 

27 

28 

29 

30 

31 

NC 

32 

33 

34 

35 

36 

37 

SF 

38 

39 

40 

41 

42 

43 

SB 

44 

45 

46 

47 

48 

49 

CF 

50 

51 

52 

53 

54 

55 

CB 

56 

57 

58 

59 

60 

61 

DIG 

62 

63 

64 

65 

66 

67 

AN 

68 

69 

70 

71 

72 

73 

EM 

74 

75 

76 

77 

78 

79 

PS 

80 

81 

82 

83 

84 

85 

XMTR 

86 

87 

88 

89 

90 

91 

BF 

92 

93 

94 

95 

96 

97 

98 

LSC/OH 
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Results 


First  stepwise  regression  was  used  on  the  60  variables  remaining 
after  the  Chow  test.  Table  VII  provides  a summary  of  the  independent 
variable  included  by  each  step  of  the  regression  along  with  R^  and  R^ 
for  the  first  48  steps.  It  can  be  seen  that  step  25  shows  a jump  of 
almost  0.03  in  R^  and  more  than  0.04  in  R^  over  the  previous  step. 
Steps  that  follow  provide  only  small  increases  in  R^  and  R^.  The 
adjusted  R^  hits  a peak  at  step  45  but  actually  changes  very  little 
between  steps  25  and  45.  After  step  45,  R^  shows  a general  downward 
trend  except  when  a variable  is  removed.  As  a result  the  equation  at 
step  25  was  chosen  as  the  best  compromise  between  correlation  and 
number  of  variables  included.  The  equation  has  23  variables,  which 
is  two  more  than  the  Westinghouse  analysis,  but  it  also  provides  an 
R^  of  more  than  0.02  higher  and  R^  of  about  0.03  higher  than  they 
achieved.  Table  VIII  contains  the  coefficients  and  the  partial  F-values 
for  each  variable  in  the  equation. 

Next,  as  a comparison,  a stepwise  regression  was  performed 
with  all  97  variables  as  inputs,  regardless  of  results  of  the  Chow  test. 
The  first  six  variables  selected  were  the  same  as  in  the  previous  run, 
but  from  that  point  the  two  equations  diverged.  The  regression  with 
all  97  variables  consistently  had  an  R^  of  0.02  to  0.03  lower  than  the 
previous  run  for  the  same  number  of  variables  included.  This  could 
be  taken  to  mean  that  the  variables  deleted  by  the  Chow  test  were 
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Table  VII 


Sequence  of  Stepwise  Regression  on  60  Variables 
Remaining  After  Chow 


Step 

Variable 

r2 

R^ 

1 

W 

. 57312 

. 56612 

2 

AN*UP 

. 65870 

. 64733 

3 

SB 

. 72117 

. 70700 

4 

DIG*W 

. 76054 

. 74402 

5 

NB 

. 78460 

.76571 

6 

XMTR*CC 

. 80354 

.78249 

7 

dig’V 

.81802 

. 79486 

8 

BF*%SS 

. 82905 

.80372 

9 

NB*V 

. 83850 

.81107 

10 

AN 

. 84931 

. 82034 

11 

EM 

. 86207 

. 83232 

12 

PS 

. 87267 

.84211 

13 

BF*W 

. 87661 

. 84387 

14 

SF 

. 88006 

. 84508 

15 

UP 

. 88449 

. 84822 

16 

%SS 

. 88799 

. 84903 

17 

XMTR*%SS 

. 89475 

.85499 

18 

NB*W 

.90005 

.85917 

19 

AN(removed) 

.90005 

. 86229 

20 

EM*V 

.90507 

. 86624 
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Table  VII  (cont'd) 


Table  VII  (cont'd) 


indeed  not  needed  and  are  only  clouding  the  isaue  when  they  are 
allowed  in  the  set  being  considered. 

Finally,  an  SPSS  run  was  made  on  the  35  variables  that  had  the 
highest  F-values  from  the  Chow  test  in  an  attempt  to  determine 
whether  the  rank  of  the  F-values  had  any  significance.  Even  with  all 
35  variables  in  the  equation  the  R^  was  only  0.88  and  R^  was  only  0.73. 
This  indicates  that  the  variables  with  the  highest  F-values  are  not  nec- 
essarily the  most  important  in  a regression.  The  ranks  of  F-values 
should  not  be  given  any  significance. 

Conclusion 

Both  SPSS  and  BMD  are  very  easy  to  use  and  require  little  fore- 
knowledge of  the  basis  of  regression.  Therein  also  lies  a danger  that 
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Table  VIII  (cont'd) 

Variable  No. 

Variable  Name 

Coefficient 

Partial  F 

70 

AN*W 

0.758240 

8.  11 

75 

EM*V 

0.422377 

8.71 

89 

XMTR*CC 

0.  294839 

25.70 

90 

XMTR*%SS 

-0.456146 

24.  86 

94 

BF*W 

0.697895 

25.  90 

96 

Constant 

BF*%SS 

-0.642736 

-5. 315378 

43.88 

79.01 

the  packages  will  be  misapplied  and  their  output  applied  blindly.  Both 
manuals  (Ref  20:320)  and  (Ref  10:215)  contain  an  introduction  to  regres 
sion  which  should  be  sufficient  background  for  most  cases.  Of  the  two 
manuals,  the  SPSS  is  more  detailed  and  easier  to  follow. 


IV . A 11  Possible  R eg  res  sions 


The  number  of  possible  subsets  given  K possible  variables 
increases  at  the  rate  of  Z as  K increases,  and  the  number  of  calcula- 
tions required  to  invert  the  moments  matrix  for  each  subset  is  of  the 
order  K . By  taking  advantage  of  the  symmetry  of  moments  matrices 
and  the  deletion  of  unneeded  rows  and  columns  as  successive  regres- 
sions are  calculated,  as  well  as  storing  moments  matrices  for  later 
modification  and  use,  the  order  of  calculations  for  each  subset  is 
brought  down  to  K^.  If  only  the  regression  coefficients,  their  vari- 
ances, and  the  sum  of  squared  errors,  are  wanted,  the  calculations 
required  are  of  order  K for  each  subset.  If  only  the  sum  of  squared 
errors  is  needed,  the  number  of  calculations  is  less  than  six  per  sub- 
set (Ref  13:500).  But  even  this  would  mean  9.  5xl029  calculations  for 
97  variables  and  6.4x10^  calculations  for  only  30  variables  just  to 
calculate  only  the  sums  of  squared  errors  for  each  possible  regres- 
sion. Clearly  some  way  to  eliminate  some  possibilities  before  calcu- 
lating is  needed  to  make  the  method  feasible  for  problems  of  this 
magnitude. 

This  is  what  has  been  developed  by  Furnival  and  Wilson  (Ref  13). 
Their  technique,  called  Leaps  and  Bounds,  is  based  on  the  fundamental 
fact  that  SSE(A)  < SSE(B)  where  A is  any  set  of  independent  variables 
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and  B is  any  subset  of  A.  It  is  impossible  for  any  subset  of  A to  have 
a lower  error  sum  of  squares  than  A.  From  this,  we  can  use  the  SSE 
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of  A as  a lower  bound  of  the  subsets  (in  the  nomenclature  of  Furnival 
and  Wilson,  offspring  of  A).  It  is  known  that  regressions  of  less  vari- 
ables than  A that  are  not  offspring  of  A have  a lower  SSE  than  A,  then 
there  is  no  need  to  investigate  the  offspring  of  A as  possible  optimal 
solutions.  In  this  way,  the  number  of  regressions  and  calculations 
are  reduced.  The  amount  by  which  they  are  reduced  is  determined  by 
how  early  in  the  branching  process  good  lower  bounds  are  found.  If  a 
good  lower  bound  is  located  early,  most  of  the  regressions  are  elimi- 
nated from  consideration  before  calculations  are  performed  on  them. 

It  was  noted  in  performing  this  study  that  the  amount  of  execution  time, 
and  hence  calculations,  varied  over  a wide  range  for  a given  number 
of  variables  input.  For  instance,  for  the  case  of  29  independent  vari- 
ables, execution  time  ranged  from  14  to  110  seconds,  depending  on 
the  variables  input. 

To  illustrate  the  method,  an  example  from  Furnival  and  Wilson 
(Ref  13:506)  will  be  duplicated  below  in  Figure  1.  There  are  five  inde- 
pendent variables  numbered  1 to  5.  Underlined  variables  are  those 
that  are  going  to  be  removed  in  offspring  equations.  Missing  variables 
indicate  those  that  have  already  been  removed  from  the  branch.  The 
numbers  in  parentheses  are  the  SSE  values  for  the  corresponding  equa- 
tion. For  example,  . 1245  indicates  the  subset  of  variables  containing 
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horizontally,  vertically  or  a combination  of  those.  The  simplest  for 
illustrative  purposes  is  the  horizontal  method  whereby  equations  of 
the  same  number  of  variables  are  considered  together.  The  process 
begins  by  calculating  the  SSE  for  each  of  the  four-variable  equations. 
Then  the  SSE  of  equation  . 123  is  calculated  as  612.  If  612  were  lower 
than  any  of  the  SSEs  from  the  four  variable  equations,  no  three-vari- 
able offspring  of  that  four- variable  equation  would  be  considered  or 
calculated.  Note  that  equation  . 125^  has  an  SSE  of  597#  which  is  lower 
than  the  SSE  of  605  for  equation  . 134J5  or  the  SSE  of  660  for  equation 
. 2345.  Therefore,  there  is  no  need  to  calculate  the  SSE  for  equations 
. 134,  . 1313,  . 145,  . 234,  . 235,  . 245,  or  . 345. 

Next  begin  to  evaluate  the  two-variable  equations  and  compare 
them  to  the  three  and  four-variable  equations  not  already  eliminated. 
Note  that  the  SSE  of  equation  . 12,  615,  is  less  than  that  for  equation 
. 2345.  Therefore,  there  is  no  need  to  consider  two-variable  off- 
spring of  equation  . 2345.  We  need  calculate  the  SSE  only  for  equations 
. 12,  . 13,  . 14,  and  . 1_5. 

Finally,  begin  to  evaluate  the  one-variable  models  and  compare 
them  to  the  two,  three,  and  four-variable  models  not  already  elimi- 
nated. If  an  SSE  of  a one-variable  model  is  found  to  be  less  than  that 
of  a two,  three,  or  four -variable  equation,  then  their  one-variable 
offspring  need  not  be  considered.  In  this  case,  all  of  the  one-variable 
equations  must  be  considered,  and  equation  . 1 is  found  to  be  the  best 
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one-variable  model. 

In  this  example,  only  17  values  of  SSE  were  calculated  to  guaran- 
tee that  the  best  one,  two,  three,  and  four-variable  equations  were 
found.  While  the  traverse  used  here  is  not  as  efficient  as  that  used 
in  the  version  of  Leaps  and  Bounds  in  the  International  Mathematical 
and  Statistical  Libraries  (IMSL),  it  is  sufficient  for  illustrative  pur- 
poses. The  IMSL  subroutine,  called  RLEAP,  provides  as  output  the 
m best  of  each  of  the  1 to  K-variables  subsets,  where  the  user  sup- 
plies the  value  of  m.  The  output  can  be  based  on  any  or  each  of  the 
criterion  of  R^,  adjusted  R^  of  C^.  The  LLSCFP  also  uses  a search 
technique  based  on  Cp,  but  the  version  available  for  this  study  allowed 
only  29  variables  to  be  input  at  a time  and  of  these,  only  12  at  a time 
could  be  searched.  This  would  cause  the  problem  to  become  too  frag- 
mented for  the  number  of  variables  that  had  to  be  considered.  As  a 
result,  the  LLSCFP  was  not  used. 


Application 

It  was  found  that  the  amount  of  execution  time  used  by  RLEAP 
grew  very  quickly  above  20  variables.  At  20  variables  execution  time 
was  only  a few  seconds  on  the  CDC  6600  series  computer.  But  at  29 
variables,  the  execution  time  could  go  as  high  as  138  seconds  octal, 
at  35  variables  over  1,000  seconds  octal  would  be  required.  As  a 
result  29  variables  were  settled  on  as  a practice  maximum  input  to 


RLEAP  for  this  study. 

f 
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Because  there  were  60  variables  remaining  from  the  Chow  test, 

« 

it  was  necessary  to  consider  them  in  groups  of  no  more  than  29  and  ^ 

I 

then  to  combine  the  optimum  answers  from  each  group  into  a new  set. 

i 

The  new  subset  formed  in  this  way  was  then  run  through  RLEAP  to  * 

pick  the  optimum  subset  from  it.  The  result  should  then  be  a rela- 

tively  small  number  of  variables  which  can  be  used  as  a core  to  rotate  j 

I 

the  remaining  variables  through  RLEAP  again.  The  program  used 

provided  as  output  for  each  criterion  the  two  best  sets  of  variables  for  J 

\ 

each  equation  size  from  1 to  the  number  of  input  variables,  and  the 

coefficients  of  the  best  subset.  The  procedure  was  begun  by  dividing  j 

the  60  variables  into  3 groups  of  20  and  using  RLEAP  on  each  group. 

2 4 

The  three  groups  are  shown  in  Table  IX.  Both  R and  Cp  statistics 


were  used  to  search  for  best  subsets.  It  was  noted  that  the  Cp  sta- 

2 

tistic  selected  an  equation  of  fewer  variables  than  did  the  adjusted  R 
criterion  in  most  cases.  In  no  case  did  the  Cp  criterion  select  an 
equation  of  more  variables  than  did  the  adjusted  R^  criterion. 


Group  1 

Group  2 

Group  3 

UP 

NF-  V 

DIG-  %SS 

V 

NF-  W 

DIG-  PD 

w 

NF-  CC 

AN  - UP 

cc 

NF-UP 

AN- V 

VcSS 

NB-  V 

EM-  UP 

NF 

NB-  W 

EM-  V 

NB 

NB-  CC 

EM-  W 

NC 

NB-PD 

EM-  CC 

SF 

NC-  UP 

EM-  PD 

SB 

NC- V 

XMTR-  UP 

CF 

NC- W 

XMTR-  V 

CB 

SF-  V 

XMTR-  W 

DIG 

SF- W 

XMTR-  CC 

AN 

SF-  CC 

XMTR-  %SS 

EM 

SF-  %SS 

XMTR-  PD 

PS 

DIG-  UP 

BF-  UP 

XMTR 

DIG- V 

BF-  V 

BF 

DIG  - W 

BF-  W 

NF.  UP 

DIG-  CC 

BF-  %SS 

Table  X 


Subset  of  Variables  of  Three  Groups  Selected  by  RLEAP 


Group  1 

Group  2 

Group  3 

UP 

NF*  W 

DIG-  %SS 

W 

NF*  CC 

DIG-  PD 

cc 

NB- V 

AN-  W 

NB 

NB-  W 

EM-  UP 

SF 

NC-  V 

EM-  V 

SB 

NC*  W 

EM-  W 

DIG 

SF-  V 

EM-  PD 

XMTR 

SF.  W 

DIG-  W 

Cp  = 5.909 

Cp  = 2.958 

Cp  = 3.  147 

RLEAP  to  pick  a smaller  subset.  The  subset  selected  by  this  run  is 
shown  in  Table  XI  below. 

These  13  variables  were  then  used  as  a core  to  add  all  of  the 
remaining  variables  into  three  groups  to  give  all  variables  a second 
chance  to  enter  the  equation.  The  new  groups  for  this  third  run  were 
formed  by  first  including  the  13  variables  from  the  second  run  and 
then  adding  15  or  16  of  the  47  remaining  variables  until  all  are 
included  in  one  group.  Three  groups  of  28,  29,  and  29  variables 
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Table  XI 


\ 


Subset  Selected  from  27  Variables  Combined 
from  Run  1 Based  on  Cp  Criterion 


UP 

NF-  CC 

EM- 

w 

DIG-  %SS 

BF- W 

SF 

DIG- PD 

BF-  %SS 

SB 

AN-  W 

DIG 

EM-  W 

_ 2 

E .83 

Cp  = 12.638 

resulted.  When  RLEAP  was  used  on  these  three  groups,  the  subsets 
selected  by  the  Cp  criterion  shown  in  Table  XII  resulted.  Asterisks 
indicate  that  the  variable  is  one  of  the  13  identified  by  the  previous  run 
and  input  into  all  three  groups.  If  the  number  of  asterisks  in  a column 
is  near  13,  it  indicates  that  the  additional  variables  had  little  to  offer 
in  improving  the  model  and  that  the  model  input  from  the  previous  run 
was  relatively  stable.  If  there  are  few  asterisks,  the  additional  vari- 
ables had  a lot  to  offer  the  model. 

Until  this  point,  the  Cp  statistic  has  been  used  as  the  cri- 
terion for  choosing  the  best  subset  because  the  objective  has  been  to 
eliminate  a large  number  of  variables  quickly.  But  the  subsets  from 
Run  3 are  small  enough  and  the  goodnes s -of-fit  good  enough  for  both 
R^  and  Cp  criterion  that  both  deserved  further  analysis.  Table  XIII 
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Table  XII 


( 


Three  Subsets  Resulting  from  the  Third  Run 
Based  on  Cp  Statistic 


Group  1 

Group  2 

Group  3 

W * 

UP  * 

UP  * 

CC 

W * 

w * 

%ss 

SF  * 

SF  * 

NB 

SB  * 

SB  * 

SF  * 

NF-  W 

NF-  CC  * 

SB  * 

NF-  CC  * 

DIG-  CC 

NF-  UP 

NF-PD 

DIG-PD  * 

NF-CC  * 

NC-  UP 

AN-  UP 

DIG- PD  * 

NC- V 

EM-  W * 

BF-  W * 

DIG-  UP 

EM- PD  * 

BF.  %SS  * 

DIG-PD  * 

BF-  W * 

AN- W * 

BF • %SS  * 

EM-  W * 

EM- PD  * 

BF-  W * 

BF • %SS  * 

Cp  = 6. 802 

R2  = .8512 

Cp  = 12.289 

R2  = .88 

Cp  = 4.  193 

R2  = . 8436 

"‘indicates  the  variable 

remains  from  the  13  from 

run  2. 
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Table  XHI 

Three  Subsets  Resulting  from  the  Third  Run 
Based  on  R^  Criterion 


Group  1 

Group  2 

Group  3 

UP  * 

UP  * 

UP  * 

W * 

W * 

W * 

cc 

SF  * 

SF  * 

%ss 

SB  * 

SB  * 

NB 

DIG  * 

DIG  * 

SF  * 

NF*W 

NF*CC  * 

SB  * 

NF*CC  * 

DIG  *V 

DIG  * 

NF*PD 

DIG#W 

EM 

NC*UP 

DIG  *%SS  * 

XMTR 

NC*V 

AN*W  * 

NF*UP 

SF*CC 

EM*W  * 

NF*CC  * 

DIG  *U  P 

EM*PD  * 

DIG*7«SS  * 

DIG*%SS  * 

BF*W  ♦ 

DIG*PD  * 

DIG*PD  * 

B F *%SS  * 

EM*W  * 

AN  *W  * 

BF*W  * 

EM*W  * 

BF*%SS  * 

EM*PD  * 

BF*W  * 

BF*%SS  * 

Cp  = 9.632  Cp  = 12.52  Cp  = 6.0 

= . 8651  = . 8954  R^  = . 8466 

’‘‘Indicates  the  variable  remains  from  the  13  from  Run  2. 


contains  the  three  sets  of  variables  selected  in  Run  3 by  RLEAP  using 
the  R2  criterion.  The  Cp  values  for  the  six  sets  in  Tables  XII  or  XIII 
can  not  be  compared  to  each  other  because  they  have  not  been  stand- 
ardized as  explained  in  Chapter  I.  But  this  will  not  cause  a problem 
because  the  K2  has  a standard  basis  for  all  cases,  and  because  the 
sets  are  going  to  be  run  through  RLEAP  again  in  various  combinations 
as  indicated  in  Figure  2.  On  the  upper  branch  after  Run  2 are  the 
sets  selected  in  Run  3 by  the  Cp  statistic.  Group  2 and  Group  3 on 
this  branch  will  be  combined  in  Run  4 because  they  are  similar.  The 
set  resulting  from  Run  4 will  then  be  combined  in  Run  5 with  the  set 
in  Group  1.  By  that  point,  the  model  should  be  stabilized  if  it  is  going 
to.  The  same  scheme  will  be  used  on  the  R^  branch  of  Run  3. 

Table  XIV  shows  the  variables  selected  in  each  group  and  eacn 
run  after  the  branch  to  Run  3.  Figure  2 shows  the  branching  with  the 
results  at  each  node  in  terms  of  number  of  variables,  Cp  value,  and 
R2.  It  can  be  seen  in  Table  XIV  that  in  both  the  Cp  and  E2  branches 
that  Group  2 dominates  the  model.  In  the  Cp  branch  the  sets  selected 
in  Run  4 combining  Groups  2 and  3 and  Run  5 combining  Group  1 and 
results  from  Run  4 are  identical  to  Group  2.  Apparently  then  Group  2 
is  the  optimum  subset  of  the  variables  in  Run  3 from  a Cp  standpoint. 

On  the  R2  branch.  Groups  2 and  3 combined  in  Run  4 to  form  a 
subset  similar  to  Group  2.  When  Group  1 was  combined  in  Run  5 with 
the  set  resulting  from  Run  4,  the  set  selected  was  identical  to  the 


Figure  Z.  Sequence  of  Leaps  and  Bounds  Runs 
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Run  2 Run  3 Run  4 

Figure  3.  Results  of  Leaps  and  Bounds  Runs 


result  from  Run  4.  Then  the  set  from  Run  4 must  the  optimal  subset 
of  the  variables  in  Run  3 from  a standpoint. 

The  two  equations  found  by  the  Leaps  and  Bounds  method  are 
described  in  Table  XV,  where  the  coefficients  and  partial  F-values 
are  given.  It  can  be  seen  that  the  set  generated  using  the  Cp  criterion 
is  a subset  of  that  generated  by  the  R2  method.  The  decision  as  to 
which  equation  is  better  will  be  deferred  until  Chapter  V,  when  the 
SPSS  generated  equation  will  also  be  considered. 

One  thing  to  keep  in  mind  is,  as  Aitken  said  (Ref  1:226),  "There 
may  be  no  one  "best"  equation,  only  a most  appropriate  one  of  several 
adequate  equations." 


Table  XV 


Equations  Selected  Using  the  All-Possible 
Regressions  Method 


In  (LSC/OH)  = a a D + & In  x.  + P D In  x- 

U i i i j jO  J j i ji  i J 


Variable 


UP 

W 

SF 

SB 

DIG 

NF*W 

NF*CC 

NF#PD 

NC*UP 

NC*V 

SF*CC 

DIG*UP 

DIG*V 

DIG*W 

DIG#PD 

AN*W 

EM*W 

EM*PD 

BF*W 

BF*%SS 

Constant 


Cp  Criterion 

R^  = 0.  9135 
R2  = 0.  88347 
F =31.21 


Partial  F 


0.  245908  8.  78 

0.384075  7.  75 

-1.061926  12.78 

-1.  822390  30.26 

■ 0.431742  31.61 

-0.466254  13.70 

0.738901  16.62 

0.285409  5.  13 

-0.334677  4.93 

■ 0.584870  12.86 


1.081951  15.  97 

0.309271  16.60 

0.698175  13.89 

• 0.  555855  21.  58 

0.866668  28.67 

-0.701034  37.03 

•3.855040  53.44 


H2  Criterion 

R2  = 0.  9323 
R2  = 0.  9001 
F =29.25 


0.  313871 
0.  350494 
-2. 878942 
-2. 195891 
4.  381530 
-0. 343076 
-0.470354 
0.672722 
0.254284 
-0.292486 
0.  293229 
-0.950128 
-0.971576 
2.676919 
0. 553008 
0.239272 
0.705835 
-0. 545678 
0.  828916 
-0.706378 
-4.091618 


Partial  F 


14.  52 

6.  86 

14.  29 
39.  06 

4.  88 

2.  10 

15.  84 
14.  59 

4.  04 
3.92 

/ e a 


11.  70 
2.25 
4.  93 
2.  59 
9.98 
13.47 
20.  61 
27.  04 
38.  19 
64.  16 


All  other  coefficients  are  zero. 


V.  Conclusions 


Results 

Three  equations  have  been  found  by  stepwise  regression  and  all 
possible  regressions  methods.  They  differ  somewhat  but  have  some 
terms  in  common  and  have  similar  performance  characteristics.  To 
aid  in  making  a choice  between  the  three,  the  variables  contained  in 
each  and  the  pertinent  statistics  are  summarized  in  Table  XVI.  The 
stepwise  result  has  the  higher  correlation  coefficients  and  overall  F 
value,  but  it  also  has  the  most  variables.  Recall  from  Table  VII  that 
when  the  stepwise  equation  had  20  variables,  as  does  the  Leaps  and 
Bounds  R2  equation,  that  the  R2  was  only  0.  9 13  and  R2  only  0.  872. 
These  values  are  somewhat  lower  than  those  for  the  Leaps  and  Bounds 
equations.  This  makes  it  appear  that  the  Leaps  and  Bounds  algorithm 
yields  more  efficient  equations. 

Validation 

Another  way  to  distinguish  between  the  two  equations  would  be 
to  use  a validation  procedure  on  all  of  them.  There  are  six  data  points 
on  LRUs  which  were  not  used  in  finding  the  models,  which  the  equa- 
tions will  be  tried  against  to  see  how  well  they  predict.  The  data  points 
are  shown  in  Table  XVII. 
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Table  XVI 


Comparison  of  Three  Regressions 


Variable 

SPSS 

L&B  Cp 

L&B  E2 

UP 

0.4027 

0. 2459 

0. 3139 

W 

0. 0845 

0.  3841 

0. 3505 

%SS 

0.4124 

NB 

11.3207 

SF 

-1.  1354 

-1. 0619 

-2. 8789 

SB 

-1.4578 

-1.  8224 

-2.  1959 

DIG 

3.7105 

4. 3815 

EM 

-2.9510 

PS 

-0. 0927 

NF-  UP 

0.3220 

NF.  W 

-0.4317 

-0. 3431 

NF-  CC 

-0. 5681 

-0.4662 

-0.4704 

NF-PD 

0. 7389 

0.6727 

. NB-  UP 

-0.7298 

NB-V 

-1.8032 

NB- W 

2. 5068 

NC-UP 

0. 2854 

0. 2543 

NC- V 

-0. 3347 

-0. 2925 

SF-  CC 

0. 2932 

DIG- UP 

-0. 5849 

-0. 9501 

DIG- V 

-1.9960 

-0.9718 

DIG-  W 

3. 0350 

2.6769 

dig-pd 

1.  0820 

0.  5530 

AN-  UP 

-0. 2721 

AN-  W 

0.7582 

0.  3093 

0. 2893 

EM- V 

0.4224 

EM- W 

0.6982 

0.  7058 

EM-PD 

-0. 5558 

-0. 5457 

XMTR-  CC 

0.2948 

XMTR-  %SS 

-0.4561 

BF-  W 

0.6979 

0. 8667 

0. 8289 

BF.  %SS 

-0.6427 

-0. 7010 

-0. 7064 

Cons  tant 

-5.3154 

-3. 8550 

-4. 0916 

R2  = 0.  95212 

R2  = 0.9135 

R2  = 0.  9323 

E2  = 0.  9239 

R2  = 0.883  5 

H2  = 0.  9001 

F = 33.  72 

F =31.21 

F =29.25 

f 


Table  XVD 


I — 


Data  Points  Used  for  Validation 


F4E 

F4E 

F4E 

F4E 

Fill 

Fill 

UP 

8046 

3398 

9831 

9910 

5514 

6650 

V 

585.  2 

323.7 

562.7 

776.  5 

1025.  2 

738.  7 

w 

12.5 

9.3 

11.8 

40.  8 

43.  3 

19.  0 

cc 

878 

58 

209 

73 

1379 

900 

%ss 

%87 

%100 

%77 

%69 

%78 

%100 

PD 

58 

75 

24 

1800 

311.6 

253.  8 

DIG 

0 

0 

0 

0 

0 

0 

ANALOG 

1 

0 

1 

1 

1 

1 

EM 

1 

0 

1 

1 

0 

0 

PS 

0 

1 

0 

0 

0 

0 

XMTR 

0 

0 

0 

0 

1 

1 

BF 

0 

0 

1 

1 

0 

0 

TYPE  USE 

SF 

SF 

SF 

SF 

NF 

SF 

i 

While  we  could  get  point  estimates  plugging  the  data  points  into 
the  regression  equations  we  have  generated,  we  would  have  no  feel 
for  the  variability  of  our  estimates.  Prediction  intervals  could  be 
generated  that  would  give  a level  of  confidence  about  the  estimates. 
None  of  the  packages  mentioned  up  to  the  point  provide  the  data  needed 
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to  calculate  prediction  intervals  readily.  But  another  simple  sta- 
tistics package,  OMNITAB,  written  by  the  National  Bureau  of 
Standards,  does.  The  formula  used  to  calculate  prediction  intervals 
in  this  case  is 


PI  = y + t 

f,  N-K-  1 


(SDPV)2  + S2  ) 


(57) 


where  £ 

y is  the  point  estimate 

SDPV  is  the  Standard  Deviation  of  the  Predicted  Value  from 

OMNITAB  and 

S is  the  residual  standard  deviation. 

All  three  of  these  values  can  be  read  directly  from  the  OMNITAB 
output  making  the  determination  of  prediction  intervals  quite  easy. 

A listing  of  the  OMNITAB  output  can  be  found  in  the  Appendix  in 
Figure  10.  The  OMNITAB  package  (Ref  18)  will  not  perform  any 
selection  of  variables,  but  once  a subset  has  been  decided  on,  it  will 
provide  useful  information  such  as  plots  and  the  standard  deviation 
measures  needed  to  calculate  the  prediction  intervals. 

Table  XVIII  contains  the  prediction  intervals  of  LSC/OH  in 
terms  of  dollars /operating  hour  generated  by  each  of  the  three  equa- 
tions for  a 90%  confidence  interval. 
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Table  XVIII 

Generated  by  the  Stepwise  and  the  Leaps  and  Bounds  Equations 


Actual  .589  . 111  . 513 


L&B  Lower  Cp  .0868  .0478  .0540 

L&B  Cp  .2350  . 1273  . 1504 

L&B  Upper  Cp  .6360  .3392  .4187 


L&cB  Lower  R . 1253  . 0328  . 0477 

L&B  E . 3314  . 0866  . 1252 

L&B  Upper  R . 8765  . 2284  . 3288 


SPSS  Lower  .0906  .1224  .0172 

SPSS  .2089  . 3068  . 0431 

SPSS  Upper  .4818  .7689  .1079 


Westinghouse  . 2 952  . 0909  . 1313 


. 0816 
. 2432 
. 7248 


. 0643 
.1916 
. 5709 


.2185 
. 5218 
1. 247 


5 

6 

2.026 

1.093 

.4033 

. 1797 

1.  1359 

.4912 

3. 1992 

1. 3427 

.3786 

. 2376 

1. 0043 

.6209 

2. 664 

1.6222 

. 2730 

. 1523 

.6658 

. 3619 

1.6242 

.8598 

1. 2879 

1.  0418 

The  most  noticeable  result  of  comparing  prediction  intervals  is 
that  the  SPSS  equation  generates  90%  intervals  that  contain  only  one  of 
the  six  actual  values,  while  both  Leaps  and  Bounds  generated  intervals 
contained  five  of  the  six  actual  values.  On  this  basis,  either  of  the 
Leaps  and  Bounds  equations  appears  to  be  a better  predictor  than  the 
SPSS  equation.  A closer  look  is  required  to  choose  between  the  two 
Leaps  and  Bounds  equations,  though. 


First,  the  distance  of  the  point  estimate  of  each  from  the  actual 
value  was  considered.  In  four  cases,  LRUs  1,  3,  4,  and  5,  the  equa- 
tion based  on  the  Cp  criterion  had  the  point  estimate  nearest  the  actual 

value.  Then  the  width  of  the  90%  prediction  interval  was  compared 
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between  the  Cp  and  H based  criterion.  Each  criterion  was  found  to 
have  the  narrower  width  of  the  prediction  interval  three  times. 

The  Cp  based  equation  appears  to  be  the  best  predictor  of  those 
considered,  based  on  generating  the  smallest  error  from  actual  values 
and  its  smaller  size,  16  variables  versus  20  or  23.  This  equation 
came  closer  than  the  Westinghouse  equation  to  the  actual  values  on  two 
of  the  six  cases,  but  Westinghouse  did  not  generate  prediction  intervals 
to  compare  to.  Two  words  of  caution  are  necessary  though.  First, 
that  the  confidence  levels  used  here  can  not  be  guaranteed  to  be  known 
exactly.  Refer  to  page  28  for  the  discussion  of  a levels.  Secondly, 
the  six  LRUs  used  for  validation  were  all  on  fighter  aircraft  in  usage 
category  SF  or  NF  and  none  were  digital. 

But  with  the  information  available,  it  appears  that  the  personnel 
of  the  Air  Force  Avionics  Laboratory  could  perform  this  same  type  of 
analysis  using  first  the  Chow  test  to  pre-select  variables  for  elimina- 
tion, then  applying  the  Leaps  and  Bounds  method  described  in  Chapter 
IV.  The  data  collection  is  the  most  time  consuming  function  and  could 
still  be  contracted  out.  Computer  analysis  of  the  data  as  was  done  in 
this  report  could  be  done  for  less  than  $100  of  computer  time  using 
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packages  already  existing  on  the  CDC  system  at  Wright-Patterson 
AFB. 

Recommendations 

At  the  present  time,  the  Westinghouse  Electric  Corporation  is 
in  the  process  of  enlarging  the  data  base  and  performing  another 
analysis.  In  progress  reports  to  the  Avionics  Laboratory,  they  have 
mentioned  the  possibility  of  finding  prediction  intervals  which  could 
be  compared  to  those  generated  for  this  report.  In  any  case,  the 
personnel  of  the  Avionics  Laboratory  could  use  the  enlarged  data  base 
to  perform  their  own  analysis  as  described  in  Chapter  IV  and  if  results 
are  suitable,  consider  generating  their  own  cost  estimating  relation- 
ships on  future  data. 

All  of  the  methods  used  in  this  report  are  based  on  minimizing 
the  sum  of  squared  errors.  Other  criterion  for  finding  optimal  sub- 
sets have  been  discussed  in  the  literature  in  recent  years  but  as  yet 
not  packaged  for  easy  use.  These  include  Mean  Square  Error  of  Pre- 
diction (Ref  2:469)  and  (Ref  6:46)  and  Average  Estimated  Variance 
(Ref  15:261).  As  these  methods  become  proven,  they  should  be  con- 
sidered for  use  on  the  data  base,  as  they  place  more  emphasis  on  pre- 
diction than  do  the  criterion  used  in  this  study. 


TO 
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rtO» 4 INPUT . OUTPUT , TAPE  1 , TPPE2  ) 


T*»l*  MCW  r 


THE  VI  AND  V2  NATRICES  USED  IN 


tapei  IS  twur  Bar* 
the  dependent  Ml 
rOOTMT  IS  4S4B.I4 

TAPC2  IS  OUTPUT 


•variables,  tsoumics,  arts 

* IN  THAT  ORDER 


DINENSION  X(13.«S).riiS3.l3i,r»(«J,i3) 

DO  t (*1,43 
READ(1.1000>  (x(E.  jj  i, 

1 CONTINUE 
DO  2 JJ-1.13 
IN-M-0 
JD'JJ+S 
DO  3 K* t ,63 

IF(XtJC,JD).EO.I.  >CO  TO  IN 
IN-INM 

Vt( IN,JJ )»X(K,20) 

GO  TO  3 

ieo  ih-ih»j  ' 

va«in,jj).xoc,20) 

3 CONTINUE 

PRINT  2002,  IN 
UPITEC2, 2002  ) 

PRINT  2030,(1 
URIT£(2,2O0O ) 

PRINT  2003, IN 
PRINT  2001,(inC,JJ.V2tINC.JJ).inC*l,IN) 
URITE(2,2O03>IA 

UP ITE (2,2001 )( jflC, JJ, V2( INC, JJ  I, INC* I, INI 

2 CONTINUE 

1000  FOR.NAT ( 4F20- 14 ) 

2000  FO*NAT( 3(SX, ’VI(,,2I2,*)**,F11.S)) 

2001  FORPAT( 3(5X, *Y2( *,2I2,*)*',F11.8)) 

2002  FOPNATax, ’N^'.IZ) 

2003  FORNAT( 1X,*N**,I2) 

STOP 


IN 

2 > IN  V 

f INC,  JJ,  VI  (INC.JJ),  INCvI , IN) 

>>  C INC, JJ, VI ( INC, JJ  > .INC* l, IN) 


*EOR 

TEOR 

ttor 


END 


Figure  4.  Program  YSEPR 
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pwo<a»n  sunsc input, output. tppei.tape2> 

CPOSSPPODUCTf ' NEEDED ^FOR^PROCRAN  cEoy00***  ^ 

JeSInde”  ffi,STA'  6l^IMUS  ,3wmiES 

FORMAT  IS  Sl4F28.14> 

TPPE2  IS  OUTPUT 

Rin?!2iW!«Xi2?i'SS*<S'fi*13,‘SSl‘*'S*‘3,‘S«<S*i3,*S“s»l3' 

DO  7901  JD*7, 19 
JJ.JD-6 

DO  7002  Jl-1,6 
S8( Jl, J J )«0. 

SKJ1.JJ1-8. 

DO  7883  J2-l,6 
$$0<J1.J2,JJ>«8. 

SSl(Jl,J2,JJ)-9. 

3 CONTINUE 
2 CONTINUE 
1 CONTINUE 

1 • 1 4'4F2e  • 1 «'<«•  • ‘ <'**& . 14  A«f  28 . 1 4 > 

POINT  8889 

I FOfiNATt  •CCNPIXTED  NODUIEP) 

DO  1 J-1,63 

READ* 1,1801 )(X(N),n*l,28) 

DO  2 JD-7,19 
JJ-JD-6 

IF ( X<  JD  > . EQ. 1 . ICO  TO  4 
DO  3 Jl-1,6 

S8(Jl.JJ)-Se(Jl,JJ)4X(Jl) 

DO  3 J2-J1.6 

I coNTINUEa'JJ1'SS®<J1'JZ'JJ,+<X<J1 ,*X(J2,) 

CO  TO  2 
I DO  S JIM. 6 

$1<J1,JJJ-S1<J1.JJ)4X(J1) 

DO  5 J2-J1.6 

f5ii/i,'i2*JJ,*ss*<Ji.J2.JJ)4(X<Jl  >(X( J2 1 ) 

’ CvnTINl^ 

! CONTINUE 
CONTINUE 
PRINT  8881 

rosnAT* -completed  nodule  2-  > 

DO  C JD-7.19 
JJ»JD-C 
DO  7 J1M.S 
JP«J1*1 
DO  • JZ-JF.S 


Figure  5.  Program  SUMS 
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f-  /.  -r 


I 


sse(ja.ji.jj)-ss*«ji,ja,jj> 

ssnja.ji.jjf-isiiji.js.jji 

8 CONTINUE 
7 CONTINUE 
6 CONTINUE 

PRINT  8892 

8092  FORNATt  'NODULE  3 COMPLETED*  ) 

DO  9 I DUN*  1 ,2 
ITEH.ICUN-1 
DO  !•  J 1 • 1 ,6 
DO  11  J2*l ,6 
IFUTEN.EO.DCO  To  199 

PRINT  2991. ( ITEM, J1  ,J2..!J.SSO(  J1.J2.JJ  ),JJ*1. 13 J 
2901  FORMAT  (3<1X,*SS*.  II,  •C.21 1,12.*  )• ,F16. 10 )/,3< IX, *SS*.I 1 . * < * .2! 1 . I 
12,*  )*,Fie.l9)/,3(  IX,  ■SS*.I1.*{,»2I1.I2,*)*.F1«.10>/,3C1X,*SS*,II.* 

i<*,2ii,x2.*>a,Fic.i0>s,tx,*ss*.ii.M*.2ii.u,‘>*,Fic.i9> 

WRITE (2, 2091 X ITEM, J 1 , J2, JJ, SS0C  J1 , J2, JJ  >,JJ*1,13) 

IF ( ITEN.EO.0 )C0  TO  11 

190  PRINT  3001. i ITEM, J1.J2, JJ, SSI (II. JI.JJl.JJ* 1,13) 

WRITE  (2, 2991  KITEM.Jl,  J2,  JJ,SS1(J1.J2,JJ),JJ*1,13) 

11  CONTINUE 
10  CONTINUE 

9 CONTINUE 

DO  12  I BUM* 1.2 
ITEfW  DUN-1 
DO  13  Jl*l,6 
IF< ITEM. £0. 1 >C0  TO  299 
PRINT  3002,(ITEN.J1.JJ,S0«J1,JJ1..J*I,131 
2992  F0RJMT(3C  1X.*S*. II.*  C. 1 1.12,  • )*,f  IS.  101/31 IX.  ‘S* , II, •<  •, II , 12. • )• 

J*wl£:i0.’»3<H<'!f;*n**(,*n*,2*,’'F,*-l*,'3<lx**sMi-,<,.M.i2.* 
1)*,F16.10)/1X,*S*.I1.*<*.I1.I2.* )*.F1S.10) 
WRITE'2,20O2)(ITEM.Jl.JJ.S0lJl#JJ  >,JJ*1.13) 

IFflTEP. £0.9)00  TO  13 

299  PRINT  2002,  (ITEM.  J1 . JJ.SU  Jl.  JJ  >.  JJ*1 . 13) 

ITEM. Jl.JJ. $1 fJl.jJ). 33*1.131 

13  CONTINUE 

12  CONTINUE 
PRINT  8993 

FORMAT!  ’COMPLETED  MODUU  4*  > 

STOP 
END 
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Figure  6.  Program  CHOW 
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